Journal reference: Computer Networks and ISDN Systems, Volume 28, issues 7–11, p. 1015.
Link management is required whenever a page is modified in a way that affects the HyperText Markup Language (HTML) [1] links referring to that page. One simple example is whenever a page is moved, all the hypertext links to that page are broken until these links are updated. Other types of page modifications that can affect links include deleting, splitting, merging, and changing a page's topic.
Page modifications occur for a variety of reasons, including the restructuring of pages based on their growth and evolution, administrative changes related to disk space limitations and performance limitations, and page removal or archiving due to aging.
The absence of link management is a serious problem within the current World Wide Web (WWW). As Ingham, et. al. [2] aptly state:
"Such broken links are the single most annoying problem faced by browsing users in the current Web. Broken links result in a tarnished reputation for the provider of the document containing the link, annoyance for the document user, and possible lost opportunity for the owner of the resource pointed to by the link."Unlike small hypertext systems [3], which tend to have bidirectional links, the WWW has unidirectional links, which, in the large scale, make it impossible to know who references a specific resource--making repair of these references very difficult. To make matters worse, the exponential growth of the WWW--with ever increasing amounts of content--leads to more and more links being created, and to more and more people making changes.
The small amount of link management support that currently exists in the WWW only covers updating browsers concerning moved resources. By manually updating a server configuration table, the HyperText Transfer Protocol (HTTP) [4] redirect directive provides a forwarding pointer to the new location of the moved resource. A browser (e.g., Netscape) attempting to access the old resource receives a redirect request to update its hotlists, and other information. This minimal solution only handles the movement of resources, and only helps updating browsers--it is not author-oriented; and, therefore, does not repair any author's documents that point at the old resource location. There is also no way of determining when this forward reference can ever be removed, and another resource can never be moved into its place.
This paper describes a link management technique that helps authors ensure the consistency of their content. This technique, called the change log table/web-walk (CLT/WW) approach, is based on web operations that log the gross web changes made by authors.
Since our focus is on helping authors, Section 2 begins by considering the link management needs of the author. This includes understanding the common web operations authors perform, the information they require to repair link inconsistencies, the constraints placed on link management by the content management systems they use, and when they need to have links automatically repaired.
Section 3 introduces the CLT/WW approach, including the characteristics of this approach, and the requirements for integrating this approach into existing content management systems. The section will end with a simple one-site example of how this approach works. Section 4 expands the discussion to enterprise link management--managing the cross-site linking of two or more content sites that are part of an organization. The remaining sections present related work, plans for future work, and concluding remarks.
When automatic link repair is not possible, authors need to know what has changed as well as how it has changed. For example, an author needs to know that page A was deleted, instead of finding a broken hypertext link to A, or, even worse, not knowing that the link to A is broken. In order to specify how a page has changed, we must define some web operations that affect an author's links. To this end, we will complete this section by identifying five such common web operations that affect an author's links.
Automatic link repair is affected by the content management model because in order to do such repairs, the content must be changed "upstream" as far as possible in the content management process to avoid being re-released in its uncorrected form--potentially undoing link repairs that were just made. The best place to make such changes is in the author's originating source content, but this may not be possible. For example, if an author's source pages are stored on his PC, and he publishes pages by submitting them to a content management system, simply fixing the pages in the content management system is not enough, since we must also fix the source pages on his PC. When the originating source pages are not available, link repairs may simply consist of notifying an author of a needed change in her content and letting her manually change her source content at her convenience.
Note that movement of pages can take place within the same server (intra-server) and between servers (inter-server). Inter-server moves are more complex and have other complications, such as whether the appropriate Common Gateway Interface (CGI) [5] based scripts, server-side directives and includes, and correct versions of software exist to correctly serve up this page on the new server.
A less obvious approach which avoids these problems, focuses on the
pages themselves. This approach builds a table containing a log of
the operations performed on the pages (not the links). We call this a
change log table*3 (CLT). A CLT might contain operations like move A
to A´, delete B, and change-title C from X to
Y. The CLT is used in conjunction with a web-walk
program*4 that walks the web structure(s) at a site.
web-walk
checks each HTML page for links*5
that refer to pages in the CLT, and, if found, notifies the appropriate
person and potentially repairs links.
The major components of this approach are:
web-walk
link-manager
server (an HTTP-based server) to retrieve pages.
link-manager
link-manager
performs three main
operations:
link-repair
to notify the author about the inconsistency and to potentially repair it.
link-manager
. This table
contains site web changes made since the last time the site's
web-walk
completed. The CLT houses the URL, operation
performed, and any extra information required. Sample CLT entries are "A
was deleted," and "B was moved to C."
link-repair
contact manager
link-repair
that determines
whom to notify about links that have been repaired and links that need to
be repaired. It also supplies permission information that controls whether
a link can be automatically repaired. This component is assumed to exist in
the content management environment.
web-walk
.This web operation/link repair independence is essential for ensuring the scalability of this approach when handling multiple autonomous sites (see Section 4). This approach decouples content update from the complex and slow operation of fixing links. This is especially true when updating content cannot be done automatically, and the author is not available for a long time (e.g., an author goes on vacation).
web-walk
. In this case, we must
wait until the next scheduled web-walk
to consider these
pages. This ensures that no needed repairs or notifications are missed by
the fact that the web-walk
had partially completed before
these pages were published. Note that we cannot solve this problem by
simply waiting until the next web-walk
completes, because this
introduces the problem of renotifying authors and refixing links for the
link references that were not missed by the previous
web-walk
. To solve this problem, any requests to the
link-manager
to add CLT information during the
web-walk
are queued until the web-walk
completes.
Queuing is easily accomplished by marking the CLT entry
as "new," if a web-walk
is in progress, and ignoring "new" entries during a web-walk
. Note that we
cannot simply lock out publishing during a web-walk,
since a
web-walk
may take a considerable amount of time.
Tracked web operations must be saved until content is
actually published. This is only required when content is not immediately
published, as in a content management system where authors test their
changes in a staging area before publishing their changes. No saving is
necessary if web operations are performed directly on the content being
served. Instead, this information can be directly passed to the
link-manager
.
link-manager
.
web-walk
. This requires access to
the root page(s) for the site, and a content administrator (or some
time-based process) that starts web-walk
.
link-manager
,
which adds it to the CLT. If a web-walk
is taking place when
this item is added, the item is marked as new and not processed by the current
web-walk
.
web-walk
is initiated. It traverses the
local web structure, starting from all the root pages on this site.
web-walk
takes a root page's URL and passes it to the
link-manager
to return the contents of this root. Links to
other pages at this site are found within this root page, and
web-walk
calls itself recursively, handing itself the URL of
the destination for each of these links.*6
link-manager
receives a request for a URL, it
first looks up the URL in the CLT, searching for a source argument that
matches this URL. If the URL is found in the CLT (as in looking up page A
from within page B):
link-manager
passes this URL (A), the URL of the referencing
page (B), and any other information from the CLT to
link-repair
.
link-repair
contacts the contact manager
to
find the contact person for the reference page (B) and permission
information for performing automatic link repair. With this information,
link-repair
notifies the appropriate person about the
inconsistency, and may also automatically fix the links in the reference
page (B), depending on the permission information and the type of the inconsistency.
If no entry for the URL is found within the CLT:
link-manager
calls the site's HTTP server with the
URL, and the results of this request are passed back to
web-walk
.
link-manager
returns either the contents of a page, or
information about the status of a page. Status information is returned when
a page was deleted, split into several pages, or cannot be found. The
contents of the page are returned for all other cases. When a page was
moved, or merged into another page, this new page's contents are returned.
For example, if page A were moved to A´, then the contents of A´ would be
returned.
web-walk
continues recursively scanning the HTML pages
returned to it, as long as these pages haven't been seen before and the
pages are within the site. web-walk
handles replies from the
HTTP server (passed back through the link-manager
), such as
the 404 response (Not Found), as well as
link-manager
information ("page deleted").
With the CLT/WW approach, content for an enterprise is treated in a fashion similar to content for one site, with a few major differences:
enterprise-manager
server is needed that tracks what sites embody the enterprise. It notifies
site link-manager
s about changes to the sites in the
enterprise and is used by these link-manager
s to determine
what other sites exist. When a web-walk
begins, the
link-manager
at that site must first contact the
enterprise-manager
to see if any site information has changed.
Note that if single point of failure is a major concern, replication of the
enterprise-manager
may be needed.
web-walk
must now access the first level of pages outside
the site (treated as leaf nodes). It contacts the
link-manager
s at other sites to retrieve off-site page
information.
link-manager
at the site pointed to by the link must return an
update request, along with its normal return value, to the calling
web-walk
. When the calling web-walk
receives this
request, it passes the request to its own site's link-manager
.
Then, the link-manager
, with link-repair
,
determines the appropriate action.
web-walk
to the link-manager
s at all the
other sites. Each site must keep track of whether another site is in the
middle of performing a web-walk
.
Table 1 Sample Multi-Site CLT Entry
web-walk
by each and every site.
Considering an entry for less than one full web-walk
can
lead to missed repairs and notifications--as described in the single site
case. This presents a problem when CLT entries are added to a site
concurrent with offsite web-walk
s that are looking at links to
this site. What is required is that each site must track all
web-walk
s that are in progress. When a CLT entry is added to a
site, it must be marked as a "new entry" for all sites
that have a web-walk
in progress. While marked as new, this
entry will be ignored by the link-manager
in lookups requested
by the sites that have a web-walk
in progress. For example, if
a CLT entry at Site #C was marked as being a "new entry" for Sites #M and
#N, any lookups by Site #C's link-manager
, in answering
web-walk
requests, will ignore this entry if the requesting
sites are #M or #N.
An entry will be unmarked as new for a particular site, when this site's
web-walk
completes. In Table 1, the
entry shown was added while Site #1's web-walk
was in progress.
web-walk
by each and every site.
Since some sites may perform two or more web-walk
s before a
CLT entry is removed, we must ensure that no attempts are made to refix
links and renotify authors on these sites. To eliminate this problem, all
non-new CLT entries are marked as "web-walk completed",
when a site completes its web-walk
. When the CLT is searched a
second time by the same site, all entries marked as completed for that site
are ignored. In Table 1, the entry shown has had a
web-walk
completed by Site #n. A CLT entry can be removed when
all of its site entries are marked as "web-walk completed."
web-walk
s.
To solve this problem, when a site is changed in one of these ways, it
must send a message to the enterprise-manager,
which relays it to all the
enterprise sites. Each site then receives an add-site, remove-site,
or change-site message. When an add-site
message is received, a new site is added to each CLT entry. CLT entries
will not be removed until this new site performs a web-walk
.
When a remove-site message is received, it removes that site from each CLT
entry. A check is then performed to see if any CLT entries can be deleted.
When a change-site message is received, the site registers an internal name
change for referring to this site.
Figure 2 Content across multiple sites--across-site linking example, moving page A to A´
In this example, we have authors Andra, Rob, and Mike. Site #1 contains Andra's
page A, and Mike's page C, that points at A. Site #2 houses Rob's page B,
which also points at A (cross-site link). This example will show Andra
moving A to A´, followed by a web-walk
of Site #1, and
then a web-walk
of Site #2:
link-manager
to add her change as a new CLT entry. If
a web-walk
were in progress on either Site #1 or Site #2, the
appropriate column of this entry would be marked as a "new entry." However,
no web-walk
is in progress.
Sometime later:
web-walk
program is initiated. It begins by
checking with the enterprise-manager
to see if any changes
have occurred to the number or names of other sites (assume none have
occurred). It then broadcasts to all link-manager
s that it has
begun a web-walk
. The Site #1 and Site #2
link-manager
s log this fact in order to catch any concurrently
added CLT entries.
web-walk
begins walking Site 1's content. For
local pages, Site 1's link-manager
is called to return pages.
For a nonlocal page, web-walk
attempts to contact the
link-manager
at the site where the page resides. Assume that
web-walk
is now looking at Mike's page C and investigating the
link to page A.
link-manager
is called to look up
page A, referenced by page C. It finds the CLT entry for A moving to
A´. It calls link-repair
, which contacts Mike stating
that page C's reference to page A was changed to page A´ (since Mike
allows links to be automatically repaired, link-repair
changes
the link reference in page C). The link-manager
returns
A´ to Site #1's web-walk
.
web-walk
finishes and then broadcasts to all
link-manager
s that it has ended its web-walk
.
Site #1's link-manager
marks all non-new CLT entries as
"web-walk completed" for Site #1 (this will mark our one entry, move
A to A´). The link-manager
then checks if any CLT entries
can be deleted. Our one entry cannot be deleted, since Site #2 has not yet
performed a web-walk
.
Sometime later:
web-walk
program is initiated. It begins by
checking with the enterprise-manager
to see if any changes
have occurred to the number or names of other sites (assume none have
occurred). It then broadcasts to all link-manager
s that it has
begun a web-walk
; the Site #1 and Site #2
link-manager
s log this fact.
web-walk
begins walking Site 2's content. Assume
it is now looking at Rob's page B and investigating the cross-site link to
page A.
link-manager
is called to look up page A,
referenced by page B. It finds the matching CLT entry (move A) and returns
A´ to Site #2's web-walk.
It also returns an update request,
stating that B's link to A should be changed to A´. Upon receiving this
request, Site #2's web-walk
passes it to the Site #2
link-manager
and continues walking the web, investigating page
A´.
link-manager
calls
Site #2's link-repair
which contacts Rob, stating that page
B's link to page A should be changed to page A´ (Rob forbids automatic link
repair to his pages).
web-walk
finishes and broadcasts to all
link-manager
s that it has ended its web-walk
.
Site #1's link-manager
marks its one CLT entry as "web-walk
completed" for Site #2. This entry is then deleted, since all sites have
performed a web-walk
for this entry.
Web validation tools, such as MOMspider [10] and Netscape's LiveWire Site Manager [13], can aid link management tasks by finding broken links, and helping administrators delete or change all the links at a site that point at a specific page. However, they do not tell authors how a page has changed. This means that an author can determine that a document is inaccessible, but will not be informed that the referenced document was deleted. These tools are also limited in that they only manage a single site; they do not handle cross-site linking within an enterprise.
The work of the Internet Engineering Task Force (IETF) on Uniform Resource Names (URN's) [14] defines a logical naming scheme that allows resources to be seamlessly moved, gives easy access to meta-information about a resource (through their closely associated URC's [15]), and allows resources to simultaneously exist in many locations on the Internet. Unfortunately, URNs have their limitations and will not be completed and adopted for some time to come. For example, URN's do not handle page deletion, merging, splitting, or page title changes. Furthermore, URNs add complexity to the generation of page names (pages must be cleared by a naming authority), their name resolution is slower and more complex, and they are "opaque"--users cannot look directly at a URN and gain any insight about its meaning.
web-walk
s; (5) the effect of
large numbers of pages referencing one page; (6) the rate at which
web-walk
s can be performed, based on how much content is
contained within a site; and (7) the "brittleness" of this approach, based
on server and component failure.Some other areas for research and extension involve:
web-walk
s.
enterprise-manager
so that it stores useful
meta information for improving enterprise link management, such as content
management policies, and site
web-walk
scheduling.
link-manager
to return a result.
[2] Ingham, D., Little, M., Caughey, S., and Shrivastava, S.
"W3Objects: Bringing Object-Oriented Technology to the Web." Fourth
International World Wide Web Conference. Boston, Massachusetts, USA.
Dec 1995.
URL: http://www.w3.org/pub/Conferences/WWW4/Papers2/141/
[4] Berners-Lee, T., Fielding, R., and Frystyk, H. "Hypertext Transfer Protocol--HTTP/1.0". IETF Internet-draft, expires Aug 1996. Feb 1996. draft-ietf-http-v10-spec-05.html. URL: ftp://ietf.cnri.reston.va.us/internet-drafts/draft-ietf-http-v10-spec-05.txt
[5] The CGI Specification. URL: http://hoohoo.ncsa.uiuc.edu/cgi/interface.html
[6] Berners-Lee, T., et al. "Uniform Resource Locators (URL)." IETF, RFC: 1738. Dec 1994. URL: http://www.cis.ohio-state.edu/htbin/rfc/rfc1738.html
[7] Fielding, R. "Relative Uniform Resource Locators." IETF, RFC: 1808. Jun 1995. URL: http://www.cis.ohio-state.edu/htbin/rfc/rfc1808.html
[8] ChURL Home Page.
URL: http://www-personal.engin.umich.edu/~yunke/scripts/churl/
[9] de Graaff, H. Checkbot. URL: http://dutifp.twi.tudelft.nl:8000/checkbot/
[10] Fielding, R. "Maintaining Distributed Hypertext Infostructures: Welcome to MOMSpider's Web." First International Conference on the World Wide Web. Geneva, Switzerland. May 1994. URL: http://www.ics.uci.edu/WebSoft/MOMspider/WWW94/paper.html
[11] NCSA HTTPd Home Page. URL: http://hoohoo.ncsa.uiuc.edu/
[12] Open Market, Inc. "WebServer Product Description." URL: http://www.openmarket.com/products/servers/server.htm
[13] Netscape Communications Corp. "NETSCAPE INTRODUCES NETSCAPE LIVEWIRE AND LIVEWIRE PRO VISUAL ONLINE DEVELOPMENT ENVIRONMENT." URL: http://www.netscape.com/newsref/pr/newsrelease41.html
[14] Sollins, K. and Masinter, L. "Functional Requirements for Uniform Resource Names." IETF, RFC: 1737. Dec 1994. URL: http://www.cis.ohio-state.edu/htbin/rfc/rfc1737.html
[15] Hamilton, M. "UNIFORM RESOURCE IDENTIFIERS & THE SIMPLE DISCOVERY PROTOCOL." Loughborough University of Technology (LUT CS-TR 985). June 95. URL: http://gizmo.lut.ac.uk/~martin/uris/uris.html
[16] A. Luotonen, A. and Altis, K. "World-Wide Web Proxies." First International Conference on the World Wide Web. Geneva, Switzerland. May 1994. URL: http://www1.cern.ch/PapersWWW94/luotonen.ps
web-walk
will not find
links in dynamically generated pages, such as those produced by CGI scripts.